Let's explore some nba player stats:
In [1]:
import numpy as np
import csv
class Rows:
"""Helper class for dealing with small data sets read in by csv
"""
def __init__(self, fname):
with open(fname, 'r') as f:
reader = csv.reader(f)
self.colnames = next(reader)
self.rows = list(reader)
def col(self, *indices, conv=True):
def col_val(c):
if len(c) == 0:
return None
return float(c) if conv else c
def row_val(row):
v = [col_val(row[index]) for index in indices]
return v
return [row_val(row) for row in self.rows]
def col_by_name(self, *colnames, conv=True):
return self.col(*(self.colnames.index(colname) for colname in colnames), conv=conv)
def clean_rows(self, *colnames, conv=True):
"""
Returns rows filtering out any row where any column is missing.
"""
row_vals = self.col_by_name(*colnames, conv=conv)
clean_row_vals = (row_val for row_val in row_vals if all([c != None for c in row_val]))
return clean_row_vals
player_stats = Rows('2014-15-player-per-game-averages.csv')
player_stats.colnames
Out[1]:
In [2]:
player_stats.col_by_name('pts')[:10]
Out[2]:
In [3]:
player_stats.col_by_name('height_inches')[:10]
Out[3]:
Would be fun to scatter plot different dimensions against each other, e.g to see whether there's a clear relationship between height and rebounds per game. Let's build a helper function to plot.
In [4]:
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
def plot(rows, x_col_name, y_col_name):
x_vals, y_vals = zip(*rows.clean_rows(x_col_name, y_col_name))
plt.scatter(x_vals, y_vals)
plt.xlabel(x_col_name)
plt.ylabel(y_col_name)
slope, intercept, r_value, p_value, std_err = stats.linregress(x_vals, y_vals)
y_predicted = [intercept + slope*x for x in x_vals]
plt.plot(x_vals, y_predicted, 'k-', color='red')
plt.show()
return r_value
r_value = plot(player_stats, 'height_inches', 'reb')
r_value
Out[4]:
So rebounding is weakly correlated with height. What about weight?
In [5]:
r_value = plot(player_stats, 'weight_lbs', 'reb')
r_value
Out[5]:
A little bit more predictive. Let's look at some more relationships for fun.
In [6]:
r_value = plot(player_stats, 'height_inches', 'weight_lbs')
r_value
Out[6]:
In [7]:
plot(player_stats, 'height_inches', 'blk')
Out[7]:
In [8]:
plot(player_stats, 'ast', 'stl')
Out[8]:
In [9]:
plot(player_stats, 'min', 'pts')
Out[9]:
Weight is very correlated with height: makes sense as most NBA dudes are ripped and lean.
Let's look at some pairings I wouldn't expect to have much correlation.
In [10]:
plot(player_stats, 'pts', 'ast')
Out[10]:
In [11]:
plot(player_stats, 'stl', 'blk')
Out[11]:
Finally, let's take a look at some of these where we label the points by position; I bet we'll see a tighter fit within particular positions.
In [12]:
set([r[0] for r in player_stats.clean_rows('position', conv=False)])
Out[12]:
In [13]:
import collections
def plot_with_pos(rows, x_col_name, y_col_name):
by_pos = collections.defaultdict(list)
for pos, x, y in rows.clean_rows('position', x_col_name, y_col_name, conv=False):
by_pos[pos].append((float(x), float(y)))
r_values = []
for color, (pos, values) in zip(('red', 'green', 'blue'), by_pos.items()):
x_vals, y_vals = zip(*values)
plt.scatter(x_vals, y_vals, color=color, label=pos)
slope, intercept, r_value, p_value, std_err = stats.linregress(x_vals, y_vals)
y_predicted = [intercept + slope*x for x in x_vals]
plt.plot(x_vals, y_predicted, 'k-', color=color)
r_values.append((pos, r_value))
plt.xlabel(x_col_name)
plt.ylabel(y_col_name)
plt.legend(loc='upper left')
plt.show()
return r_values
plot_with_pos(player_stats, 'height_inches', 'reb')
Out[13]:
Interesting, so once you break it down by position, the correlation between height and rebounding dissappears; being a tall guard apparently isn't really going to help you on the rebounding front. Or perhaps more likely, once you are tall, you are unlikely to be a guard.
Let's check out a few more.
In [14]:
plot_with_pos(player_stats, 'weight_lbs', 'reb')
Out[14]:
In [15]:
# reminding myself of the stats available
player_stats.colnames
Out[15]:
In [16]:
plot_with_pos(player_stats, 'ast', 'stl')
Out[16]:
The correlation of assists and steals is one of the few that actually holds up when you drill in across positions.
Let's end with come C->Q comparisions by looking at single stats across positions with side-by-side box plots.
In [17]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
In [19]:
def box_plot_by_pos(rows, col_name):
by_pos = collections.defaultdict(list)
for pos, val in rows.clean_rows('position', col_name, conv=False):
by_pos[pos].append(float(val))
plt.figure()
plt.ylabel(col_name)
pos_in_order = ['Guard', 'Forward', 'Center'] # plot looks better in this order (smallest to largest pos)
plt.boxplot([by_pos[pos] for pos in pos_in_order], labels=pos_in_order)
plt.show()
for el in ['height_inches', 'weight_lbs', 'min', 'pts', 'fg_pct', 'reb', 'ast', 'blk', 'stl']:
box_plot_by_pos(player_stats, el)